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We study in detail a particular statistical method in genetic case-control analysis, labeled "genotype- 
based association", in which the two test results from assuming dominant and recessive model are 
combined in one optimal output. This method differs both from the allele-based association which 
artificially doubles the sample size, and the direct \ 2 test on 3-by-2 contingency table which may 
overestimate the degree of freedom. We conclude that the comparative advantage (or disadvantage) 
of the genotype-based test over the allele-based test mainly depends on two parameters, the allele 
frequency difference 8 and the Hardy-Weinberg disequilibrium coefficient difference <5 e . Six different 
situations, called "phases", characterized by the two X 2 test statistics in allele-based and genotype- 
based test, are well separated in the phase diagram parameterized by <5 and <5 e . For two major groups of 
phases, a single parameter 9 = tan~ x (& / 8 € ) is able to achieves an almost perfect phase separation. 
We also applied the analytic result to several types of disease models. It is shown that for dominant 
and additive models, genotype-based tests are favored over allele-based tests. 



1. Introduction 

Genetic association analysis is a major tool in mapping human disease gene 
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A 

simple association study is the case-control analysis, in which individuals with and without 
disease are collected (roughly the equal number of sample per group for an optimal design), 
DNA samples extracted and genetic markers typed. The prototype of a genetic marker is the 
two-allele single-nucleotide-polymorphism (SNPpJ If the two alleles are A and a, there 
three possible genotypes: AA, Aa, aa, consisting of the maternally-derived and paternally- 
derived copy of an allele. The three genotype frequencies are calculated in case (disease) 
and control (normal) group, and a strong contrast of the two sets of genotype frequencies 
can be used to indicate an association between that marker and the disease. 

The statistical analysis in an association study seems to be simple - mostly the standard 
Pearson's \ 2 test m categorical analysis^ there are nevertheless subtle differences among 
various approaches. Some people use the 2x3 genotype count table to carry about test 
with x 2 distribution of df = 2 degrees of freedom^. This method may overestimate the 
degree of freedom if the Hardy-Weinberg equilibrium holds true. Other people use the 
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allele-based test, where each person contributes two allele counts, and the allele frequency 
is compared in a 2 x 2 allele count table. This approach artificially doubles the sample 

n~7i 

sizes without a theoretical justification . A third approach, what we called "genotype- 
based" case-control association analysis, remains faithful to the sample size, while does 
not overestimate the degrees of freedom. 

A genotype-based analysis can be simply summarized here. Two Pearson's \ 2 tests are 
carried out on two 2x2 count tables: the first is constructed by combining the AA and 
Aa genotype counts and keeping the aa genotype column, and the second by combining 
the Aa and aa genotype counts. If the marker happens to be the disease gene and A is the 
mutant allele (a is the wild type allele), then the first table is consistent with a dominant 
disease model, whereas second a recessive disease model. The two % 2 tests lead to two 
p-values, and the smallest one (the more significant one) is chosen as the fin al test result. 

Genotype-based analysis has been used in practice many time iP^ I 1 819] w j mou t a par- 
ticular name, and without a theoretical study. In this article, we will take a deeper look of 
the genotype-based analysis. We will show that the justification of using genotype-based 
tests is intrinsically related to the Hardy-Weinberg disequilibrium, but there are more than 
just a non-zero Hardy-Weinberg disequilibrium coefficient that is important. 

The article is organized as follows: we first show that there is no advantage in using 
genotype-based test if there is no Hardy-Weinberg disequilibrium; we then examine the 
situation with Hardy-Weinberg disequilibrium, and use the two parameters, the allele fre- 
quency difference and the difference of two Hardy-Weinberg disequilibrium coefficients, 
to construct a phase diagram; the phase diagram is further simplified by using just one pa- 
rameter; our analytic result is illustrated by a real example from the study of rheumatoid 
arthritis; we apply the formula to different models; and finally future works are discussed. 

2. No advantage for genotype-based analysis if Hardy-Weinberg equilibrium 
holds true exactly 

In an ideal situation, we assume N case samples and N control samples, and the A allele 
frequency in case and control groups is p\ and p% (q% = 1 — pi, q 2 = 1 — p 2 ). On average 
(or in the asymptotic limit), the allele and genotype counts are listed in Table Q] where the 
Hardy-Weinberg equilibrium (HWE) is assumed. 

For a {Nij} = 1,2) 2-by-2 contingency table, the Pearson's (OE) 2 /E (O for 
observed count, and E for expected count) test statistic is: 

X 2 _ {NuN 22 - N 12 N 21 ) 2 (N U + N 12 + N 21 + ^22) 
(JV11 + JVi 2 )(JV2i + JV 22 )(iVii + N 21 )(N 12 + N 22 ) 
Using the table elements in Table [1] we can derive 

_ (2N) A {p 1 q 2 -p 2 q 1 )HN _ 4N( Pl -p 2 ) 2 



X 2 



allele 



(2N) 4 (pi +p 2 )(qi + q 2 ) {pi +p 2 ){qi + 172) 
x2 = N%p 2 + 2p iqi )q 2 - (p 2 + 2p 2 q 2 )q 2 } 2 2N 2N( Pl - p 2 f ( gl + q 2 ) 2 
dom m{p 2 + 2p lQl +p 2 + 2p 2 q 2 ){q\ + q\) (2 - q\ - q 2 )(q 2 + q 2 ) 



N 4 [(q 2 + 2p iqi )p 2 - (q 2 + 2p 2 q 2 )p\\ 2 2N _ 2N{q x - q 2 ) 2 ( Pl + p 2 ) 2 
N^q 2 + 2p iqi +q 2 + 2p 2 q 2 )(p 2 +p 2 ) (2 - p\ - p 2 ){p 2 + p 2 ) 
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Table 1 . Count tables for genotype-based analysis under HWE 





A a 


AA + Aa aa 


AA AA + Aa 




allele count 


dominant model 


recessive model 


case 


2 Npi 2 Nqi 


N(p{ + 2pi 9 i) Nqf 


Np( N(2p iqi +qf) 


control 


2 Np 2 2 Nq 2 


N(p1 + 2p 2 q 2 ) Nql 


Np 2 2 N{2p 2 q 2 + ql) 



To further simplify the notation, let's denote S = p\ — p 2 as the allele frequency difference, 
p = (pi + p 2 )/2 as the averaged A allele frequency across groups, and the averages of the 
squared terms p 2 = (p\ + p 2 )/2 (q and q 2 are defined similarly). Then Eq.© becomes: 

NS 2 



allele 



on i 



2NS 2 q 



xL = ^%- (3) 

p 2 (l — p 2 ) 

Since the genotype-based test is determined by the maximum value among X^ om and X 2 CC , 
we would like to prove an inequality between X^ lele and max(X 2 om ,X 2 cc ). 

Towards this aim, we first compare X 2 llclc and X 2 om . Due to the following two in- 
equalities: 

- = 2q\ + 2q 2 2q 2 + 2q 2 - ( gl - q 2 ) 2 = {q x + q 2 f = _ 2 
q 4 4 4 q 

2p-q = 2q-2q 2 =q 1 +q 2 -(q 1 q 2 +^ 2 ) = \ -q 2 - (1 - qi ){\ - q 2 ) < 1 - 7, 

we have 

2p • q 1-q 2 

which leads to X^iicic — ^dom- The similar approach shows that p 2 > p 2 and 2p ■ q < 
1 - p 2 , which leads to Xf Uele > X 2 CC . 

With the proof that X 2 Uclc > max(X^ om , X 2 CC ), we have shown that allele-based X 2 
(p-value) is always larger (smaller) than the genotype-based X 2 (p-value). In other words, 
if HWE holds exactly true, there is no need to carry out a genotype-based association 
analysis. To certain extend, this result is not surprising since allele-based test utlizes twice 
the number of samples as the genotype-based test, even though the latter has one advantage 
of testing multiple (two) disease models. Clearly, the increase in sample size more than 
compensates the advantage of testing multiple models, when HWE is true. 



3. Adding violation of Hardy- Weinberg equilibrium 

The result in the previous section actually does not disapprove the genotype-based asso- 
ciation, since HWE in real data is often violated, even if it is not significantly violated. 
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Table 2. Count tables for genotype-based analysis under HWD 





A a 


AA + Aa aa 


AA 


+ Aa 




allele count 


dominant model 




recessive model 


case 


2 TVpi 2 Nqi 


N(pf + 2piqi - ei) iV(g* + e x ) 


N(p? + 


ei) N(2pigi +0^ - ei) 


control 


2 Np 2 2 Nq 2 


AT(p| + 2p 2 g 2 - £2) A r (gl + e 2 ) 




e 2 ) N{2p 2 q 2 + ?| - e 2 ) 



To characterize a realistic genotype count table, one more parameter besides the allele fre- 
quency is needed: the Hardy- Weinberg disequilibrium coefficient (HWDc)^. The HWDc 
e is defined as^e = paa - p\ = Paa ~p\ = ~{PAa ~ 2p a p A )/2 = p aa pAA - VaJ^- 
For case and control groups, two HWDc's are used ex and e 2 . The three count tables under 
HWD are now parameterized in Tabled 

Applying the definition of X 2 in Eq.(Q~|) to the count tables in Table |2] (note that the 
allele counts are not affected by HWD), we have 



Y 2 

^-allele 



x 2 



(pi +P2)(qi +92) 

N 4 [(p 2 + 2p iqi - ei)(<zl + (2) - {p\ + 2p 2 q 2 - € 2 )(q 2 + £i)] 2 2iV 



m,HWD Ni{p2 + +p2 + 2p2Q2 e2){q2 + g2 + ^ + £2) 

2N[{p 1 -p 2 ){q 1 +q 2 )-{e 1 -e 2 )] 2 



X, 



2 



(2 - q 2 - q 2 - ei - e 2 )(q 2 + q 2 + e l + e 2 ) 

N 4 [(q 2 + 2p lQl - e x )(p\ + e 2 ) - (q 2 + 2p 2 q 2 - e 2 ){p 2 + £l )] 2 2iV 



; ' ! ' HWD N\q 2 + 2 Pm + q 2 + 2 P2 q 2 - Cl - e 2 )(p 2 + p 2 + ei + e a ) 
27V[(g 1 - (?2 )(p 1 +p 2 )-(e 1 -e 2 )] 2 



(5) 



(2 - p\ - p 2 - £l - e 2 )(p 2 +p 2 + e 1 + e 2 ) 

Again shorthand notations are introduced: S e = t\ — e 2 , and e = (ei + e 2 )/2. Eq.© is 
rewritten as 

y 2 ^ 2 

2iV(^-^) 2 



^dom.HWD 



(q2 + e)(l-q 



2 



x 2 - msp-^r 

From Eq.©, it is not clear whether Xf Uolc is still larger than ^J om . H wD and ^toc.hwd- 
Systematic scanning of the 4-parameter space (pi , p 2 , t\ , e 2 ) would offer a solution, but the 
result cannot be displayed on a 2-dimensional space. In the following, we simplify the 
display of the "phase diagram" by using only two (or one) parameters. 



4. Phase diagram with one and two parameters 

The term "phase diagram" is borrowed from the field of statistical physics^! i n a typical 
diagram used in statistical or chemical physics, phases (e.g. solid, liquid and gas) as well as 
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phase boundaries (e.g. melting line) are displayed as a function of physical quantities such 
as temperature and pressure. Phase transition occurs at phase boundaries. For our topic, a 
phase indicates, for example, whether allele-based or genotype-based test leads to a higher 
X 2 value; or it can indicate whether or not the X 2 value leads to a statistically significant 
result (e.g. p-value < 0.05). The quantities chosen to mimic temperature or pressure for 
our topic should highlight the phase separation and phase transitions. 

Eq.© provides us a hint that the allele frequency difference in two groups, 5, and the 
HWDc difference, 6 e , could be good quantities for phase separation. First of all, i5 directly 
controls the magnitude of X 2 , so it should separate "significant phases" from "insignifi- 
cant phases". Secondly, the relative magnitude and sign of S and S e seems to control the 
difference between X^ lclc and X 2 ola HWD or X 2 CC HWD , so it should be a good quantity to 
separate "favoring-allele-based-test phase" (whenXf Uc i c > max(X| om HWD , X 2 CC HWD )) 
and "favoring-genotype-based-testphase" (when Xf Uele < max(X d 2 om3WD , I r 2 CCjHWD )). 

We carried out the following simulation to construct the phase diagram: 5000 replicates 
of case-control datasets with 100 cases and 100 controls (in another simulation, the sample 
size is 1000 per group); For each replicate, the three genotypes are randomly chosen, then 
the allele frequency and Hardy-Weinberg disequilibrium coefficient were determined. FigQ] 
shows the simulation result parameterized by S e (x-axis) and S (y-axis). Six phases (labeled 
I- VI) are illustrated using 6 different colors, within the two larger categories: 

• Favoring genotype-based tests (crosses in FigQ]) 

- I. p-values for both genotype- and allele-based tests are < 0.05 (red) 

- II. p-values for both genotype- and allele-based tests are > 0.05 (yellow) 

- III. p-value for genotype-based test is < 0.05, that for allele-based test is > 
0.05 (pink) 

• Favoring allele-based tests (circles in FigQ]) 

- IV. p-values for both genotype- and allele-based tests are < 0.05 (purple) 

- V. p-values for both genotype- and allele-based tests are > 0.05 (blue) 

- VI. p-value for allele-based test is < 0.05, that for genotype-based test is > 
0.05 (green) 

As can be seen from Fig[T] the two parameters, S and S e does a pretty good job in sep- 
arating six different phases, although minor overlap between phases occurs. The overall 
performance of S and 5 e as phase parameters is satisfactory. 

As expected, the magnitude of p-values is mainly controlled by the y-axis. Smaller 
allele frequency differences (smaller (S's) result in non-significant p-values, and significant 
results are located far away from the (5 = line. On the other hand, the S e mainly controls 
whether allele-based or genotype-based test is more significant. However, S e itself is not 
enough: it acts jointly with S to achieve the phase separation: for genotype-based test to 
have a smaller p-value than the allele-based test and both are smaller than 0.05 (red points 
in FigQ]), fie tends to have the different sign as that of S. 

The effect of sample size on the phase diagram can be examined by comparing Fig[T]A) 
and Fig[ljB). Phases II, III, V, VI all shrink in area simply because a larger sample size is 
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HWDc (case) - HWDc (control) 

Figure 1. The phase diagram parameterized by 5 e = t\ — t2 (x-axis) and 6 = pi — p2 (y-axis), where p 
is the allele frequency for A and e is the Hardy- Weinberg disequilibrium coefficient, determined by a numerical 
simulation. (A) 100 samples per group with 5000 replicates (5000 points in the plot); (B) 1000 samples per group 
with 5000 replicates. Six phases are marked: I. p-value for genotype-based test is smaller than that for allele- 
based test (and both p-values are smaller than 0.05) (red cross); II. similar to I, but both p-values are larger than 
0.05 (yellow cross); III. similar to I, but one p- value is smaller than 0.05 and another larger than 0.05 (pink cross); 
IV. p-value for allele-based test is smaller than that for genotype-based test (and both p-values are smaller than 
0.05) (purple circle); V. similar to IV, but both p-values are larger than 0.05 (blue circle); VI. similar to V, but 
one p-value is smaller than 0.05 and another larger than 0.05 (green circle). The two dashed lines have angle of 
73.125° and -73.125°, and the solid line has angle of 95.37 c irc. 
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Table 3. Count tables of marker genotype for a SNP within the gene 
PTPN22 





TT 


TC 


CC 


total 


PT 


e 


case 


16 


245 


677 


938 


.147655 


-0.004744 


control 


12 


221 


1168 


1401 


.087438 


+0.000920 


difference 










.060217 


-.005664 



more likely to lead to a p-value < 0.05 replicate. The relative location of different phases 
in FigQ]remains the same. 

If we focus on the two major categories (phases I,II,III versus phases IV,V,VI), we 
notice that the phase boundaries are radiuses. The observation led to the following 
phase diagram by using a single parameter 9 — tan~ 1 {y/x) = tan~ 1 (5/5 e ), i.e., 
the angle between a radius and the x-axis. To measure the relative advantage (disad- 
vantage) of allele-based test over genotype-based test, we use the ratio of two X 2, s: 
A = X^icic/ nmx(X 2 cc , X% om ). FigfJ] shows A as a function of 9, using the simulation 
result in Fig[T](100 samples per group and 1000 samples per group) and the same color 
code for six phases. 

FigE] shows that within the range of -13tt/16 < 6 < 13tt/16 (-73.125° < < 
73.125°, or —3.2966 < 5/5 e < 3.2966), the genotype -based test is favored over the allele- 
based test. Overlap of phases still occurs in Fig|2] indicating the phase separation is not 
perfect. The allele-based test is much better than the genotype-based test when 9 = tt/2 
(90°). and the genotype-based test is much better than the allele-based test when 9 = (or 
5 = 0). 

The sample size per group does not affect the phase boundary between the two major 
categories, though it does affect phases within a major category. This observation can be 
understood theoretically by the formula of X 2, s in Eq.©: the relative magnitude between 
Allele an d ^dom.HWD or ^rec.HWD i s independent of N as it is canceled out. 

5. Illustration by a real dataset 

The genotype counts of a missense SNP in gene PTPN22 in Rheumatoid Arthri- 
tis samples and in control samples are listed in Table [3] (combining the "discovery" 
dataset and the "single sib" option in the "replication" dataset in Ref. 3). Our for- 
mula predicts that 9 = tan" 1 (0.147655 - 0.087438)/(-0.004744 - 0.000920) = 
ta?!" 1 (-0.060217/0.005664) = 95.37°. This 9 line is marked both in Fig[T]and Fig|2] 
in solid lines, and is within the phase where the allele-based test is preferred. Our cal- 
culation predicts that the allele-based test and genotype-based test should lead to similar 
result^ Indeed, X^=ilA0, ^I om:HW D= 42 - 26 < X rcc,HWD= 3 - 43 < and allele-based and 
genotype-based test statistics are essentially the same. 



a One difference however is that the theoretical calculation is based on equal number of samples in case and control 
group. In our example, the sample size in two groups is slightly different. 
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theta 



Figure 2. The X 2 ratio A = -X?n e i e / max(X r 2 oc , X j om ) as a function of the parameter 8 = tan -1 (8/S e ). 
The same color code for the six phases as used in Fig[T]is also used here. For phases that favor the genotype-based 
test, A < 1; for those favoring allele-based test, A > 1. (A) 100 samples per group with 5000 replicates; (B) 
1000 samples per group with 5000 replicates. The vertical dashed lines correspond to angles of ± 73.125° and ± 
106.875°, and the solid vertical line corresponds to angle of 95.37 c ire. 



6. Hardy- Weinberg disequilibrium in the patient population given a disease 
model 

In the population of patients (case group), a SNP marker within the disease gene or in 
linkage disequilibrium with the disease usually violates the Hardy-Weinberg equilibrium. 
This fact has been used in the proposal of using HWD in case samples to map the disease 
genJU. The HWD coefficient in the case group can be calculated if the disease model is 
giverJEI, which is reproduced here. Assuming the penetrance for AA, Aa, aa genotypes 
to be /aa, Aa, /aa, the disease prevalence is K = /aapI + /A a 2pi<7i + f aa qf , and the 
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K 



(7) 



PAa.aff 
4 



2 



£l = PAA.aff • Paa,aff — 



aa 




(8) 



and the HWD coefficient for the control group is assumed to be zero (e2 = 0). 

If the disease model is multiplicative, i.e., /aa / /Aa = /Aa//aa, there is no HWD in 
the case group, so HWD can not be used to map the disease gene. With 5 e — — = 0, 
from the result in Sec. [2] the allele-based test is favored over the genotype-based test. For 
dominant models, /aa ~ /Aa = F, and t\ oc F(f aa — F). Since we usually assume low 
phenocopy rate, i.e., / aa m 0, the HWDc t\ oc — F 2 is negative. If the mutant allele A 
is enriched in case samples (<5 — pi — P2 > 0), with the 5 e < in dominant models, we 
conclude that genotype-based test is favored over allele-based tests. For recessive models, 
/Aa ~ /aa ~ 0, t\ oc 0, so the allele-based test is better. For additive models, /a & = 
/ aa + A, /aa = /aa + 2A, where A is the contribution to the penetrance by adding one 
copy of the mutant allele. The i5 e is equal to ei oc (/ aa + 2A)/aa — (/ aa + A) 2 = — A 2 < 0. 
Thus genotype-based test is favored for additive disease models. 

7. Discussion and future works 

The main point of this article is that genotype-based test may take advantage of certain 
Hardy-Weinberg disequilibrium in case samples to overcome the advantage of larger sam- 
ple sizes in allele-based tests. Another advantage of the genotype-based test is that it tests 
two models and picks the best one. This multiple testing might be corrected by multiplying 
the p-value by a factor of 2 (Bonferroni corrections), which wa s not don e in this article. 
Whether correcting multiple testing or not is always under but its effect on 

our problem is probably to shift the phase boundary slightly. 

The X 2 test statistic calculation in this article was all carried out assuming equal num- 
ber of samples in case and control group. Changing this assumption to unequal number of 
samples per group is not difficult, but its effect on the conclusion has not been examined. 

Here we are addressing the type-I error of the test, the p-value, which is determined by 
the X 2 test statistic. For type-II error under alternative hypothesis, usually a non-central 
X distribution could be use cP. However, other alternatives to non-central x 2 distribution 
to calculate type-II error and the power have been proposecf3. 
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